I loaded up the given 2D-3D correspondences, split them up and projected and normalized them
The points are then stacked up to form a lineaar equation of $Ap =0$, where $p$ is the flattened camera mtrix $P$
The linear equation is derived from the following
$A$ matrix is the $3\times12$ matrix shown, formed by stacking up 2D-3D correspondences
After getting the $A$ matrix, I get the $p$ matrix by taking an SVD of the $A$ matrix
$P$ matrix is extracted by re-shaping the $p$ vector
The camera matrix is then used to transform the given surface points into the image plane using $x = PX$
Similarly, for the bounding box, the outer points are projected, normalized and then transformed to the image plane
The camera matrix came out as
$\begin{bmatrix} 6.50945644e+03 &-2.93252813e+03& 1.06713749e+03 & 2.22753498e+03 \\-8.60882948e+02& -6.73715135e+03 & 1.96129030e+03 &1.82145602e+03\\
6.22451352e-01& -1.41369464e+00 &-7.89678657e-01& 1.00000000e+00\end{bmatrix}$